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Abstract. This paper explains the design and implementation of a high- 
security elliptic-curve-Diffie-Hellman function achieving record-setting 
speeds: e.g., 832457 Pentium III cycles (with several side benefits: free 
key compression, free key validation, and state-of-the-art timing-attack 
protection), more than twice as fast as other authors’ results at the same 
conjectured security level (with or without the side benefits). 
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1 Introduction 

This paper introduces and analyzes Curve25519, a state-of-the-art elliptic-curve- 
Diffie-Hellman function suitable for a wide variety of cryptographic applications. 
This paper uses Curve25519 to obtain new speed records for high-security Diffie- 
Hellman computations. 

Here is the high-level view of Curve25519: Each Curve25519 user has a 32- 
byte secret key and a 32-byte public key. Each set of two Curve25519 users has 
a 32-byte shared secret used to authenticate and encrypt messages between the 
two users. 

Medium-level view: The following picture shows the data flow from secret 
keys through public keys to a shared secret. 


Alice’s secret key a Public string 9 Bob’s secret key b 



Curve25519(a, Curve25519(6, 9)) Curve25519(6, Curve25519(a, 9)) 
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A hash of the shared secret Curve25519(a, Curve25519(6, 9)) is used as the key 
for a secret-key authentication system (to authenticate messages), or as the key 
for a secret-key authenticated-encryption system (to simultaneously encrypt and 
authenticate messages). 

Low-level view: The Curve25519 function is F p -restricted ^-coordinate scalar 
multiplication on E(F p z), where p is the prime number 2 255 — 19 and E is the 
elliptic curve y 2 — x 3 + 486662x 2 + x. See Section 2 for further details. 

Conjectured security level. Breaking the Curve25519 function—for example, 
computing the shared secret from the two public keys—is conjectured to be 
extremely difficult. Every known attack is more expensive than performing a 
brute-force search on a typical 128-bit secret-key cipher. 

The general problem of elliptic-curve discrete logarithms has been attacked 
for two decades with very little success. Generic discrete-logarithm algorithms 
break prime groups that are not sufficiently large, but the prime group used 
in this paper has size above 2 252 . Elliptic curves with certain special algebraic 
structures can be broken much more quickly by non-generic algorithms, but 
E(F p 2 ) does not have those structures. See Section 3 of this paper for more 
detailed comments on the security of the Curve25519 function. 

If large quantum computers are built then they will break Curve25519 and 
all other short-key discrete-logarithm systems. See [56] for details of a general 
elliptic-curve-discrete-logarithm algorithm. The ramifications of this observation 
are orthogonal to the topic of this paper and are not discussed further. 

Efficiency. My public-domain Curve25519 software provides several efficiency 
features, thanks in large part to the choice of the Curve25519 function: 

• Extremely high speed. My software computes Curve25519 in just 832457 
cycles on a Pentium III, 957904 cycles on a Pentium 4, 640838 cycles on a 
Pentium M, and 624786 cycles on an Athlon. Each of these numbers is a 
new speed record for high-security Diffie-Hellman functions. I am working 
on implementations for the UltraSPARC, PowerPC, etc.; I expect to end up 
with similar cycle counts. 

• No time variability. Most speed reports in the cryptographic literature are 
for software without any protection against timing attacks. See [12], [51], 
and [50] for some successful attacks. Adding protection can dramatically 
slow down the computation. In contrast, my Curve25519 software is already 
immune to timing attacks, including hyperthreading attacks and other cache¬ 
timing attacks. It avoids all input-dependent branches, all input-dependent 
array indices, and other instructions with input-dependent timings. 

• Short secret keys. The Curve25519 secret key is only 32 bytes. This is 
typical for high-security Diffie-Hellman functions. 

• Short public keys. The Curve25519 public key is only 32 bytes. Typical 
elliptic-curve-Diffie-Hellman functions use 64-byte public keys; those keys 
can be compressed to half size, as suggested by Miller in [46], but the time 
for decompression is quite noticeable and usually not reported. 

• Free key validation. Typical elliptic-curve-Diffie-Hellman functions can be 
broken if users do not validate public keys; see, e.g., [14, Section 4.1] and [3]. 



The time for key validation is quite noticeable and usually not reported. In 
contrast, every 32-byte string is accepted as a Curve25519 public key. 

• Short code. My software is very small. The compiled code, including all 
necessary tables, is around 16 kilobytes on each CPU, and can easily fit, 
alongside other networking tools in the CPU’s instruction cache. 

The new speed records are the highlight of this paper. Sections 4 and 5 explain 
the computation of Curve25519 in detail from the bottom up. 

One can improve speed by choosing functions at lower security levels; for 
example, dropping from 255 bits down to 160 bits. But—as discussed in Section 
3—I can easily imagine an attacker with the resources to break a 160-bit- elliptic 
curve in under a year. Users should not expose themselves to this risk; they 
should instead move up to the comfortable security level of Curve25519. 

Of course, when users exchange large volumes of data, their bottleneck is a 
secret-key cryptosystem, and the Curve25519 speed no longer matters. 

Comparison to previous work. There is an extensive literature analyzing the 
speed of various implementations of various DifRe-Hellman functions at various 
conjectured security levels. 

In particular, there have been some reports of high-security elliptic-curve 
scalar-multiplication speeds: [17, Table 8] reports 1920000 cycles on a 400 MHz 
Pentium II for held size 2 256 — 2 224 + 2 192 + 2 96 — 1; [33, Table 7] reports 1740000 
cycles on a 400 MHz Pentium II for held size 2 283 using a subheld curve; [4, 
Table 4] reports 3086000 cycles on a 1000 MHz Athlon for a random 256-bit. 
prime held. At a lower security level: [7, Table 3] reports 2650000 cycles on a 
233 MHz Pentium MMX for held size (2 31 — l) 6 ; [58, Table 4] reports 4500000 
cycles on a 166 MHz Pentium Pro for held size (2 31 — 19) 6 ; [26, Table 6] reports 
1720000 cycles on an 800 MHz Pentium III for held size 2 233 . 

The Curve25519 timings are more than twice as fast as the above reports. The 
comparison is actually even more lopsided than this, because the Curve25519 
timings include free key compression, free key validation, and state-of-the-art 
timing-attack protection, while the above reports do not. 

I have previously reported preliminary implementation work achieving about 
half of this speedup using a standard NIST curve. The other half of the speedup 
relies on switching to a better-designed curve. This paper covers both halves of 
the speedup. 

At a lower level, designing and implementing an elliptic-curve-Difhe-Heilman 
function means making many choices that affect speed. Making a few bad choices 
can destroy performance. In the design and implementation of Curve25519 I have 
tried to globally optimize the entire sequence of choices: 

• Use large characteristic, not characteristic 2. 

• Use curve shape y 2 — x 3 + Ax 2 + x, with (A — 2)/4 small, rather than 
y 2 = x 3 — 3x + clq. 

• Use a: as a public key, not (x, y). 

• Use a secure curve that also has a secure twist, rather than taking extra time 
to prohibit keys on the twist. 



• Use x/z inside scalar multiplication, not (x/z,y/z) or (x)z 2 ,y/z 3 ). 

• Convert variable array indexing into arithmetic. 

• Use a fixed position for the leading 1 in the secret key. 

• Multiply the secret key by a small power of 2 to account for cofactors in the 
curve group and the twist group. 

• Use a prime field, not an extension field. 

• Use a prime extremely close to 2 b for some b. 

• Use radix 2 b / w for some w 1 even if b/w is not an integer. 

• Allow coefficients slightly larger than the radix, rather than reducing each 
coefficient as soon as possible. 

• Put coefficients into floating-point registers, not integer registers. Choose w 
accordingly. 

See Sections 4 and 5 for details and credits. Beware that these choices interact 
across many levels of design and implementation: for example, there are other 
curve shapes and prime shapes for which (x/z 2 ,y/z 3 ) is better than x/z. This 
type of interaction makes the optimal sequence of choices difficult to identify 
even when all possible choices are known. 

2 Specification 

This section defines the Curve25519 function. Readers not familiar with rings, 
fields, and elliptic curves should consult Appendix A for definitions and for a 
proof of Theorem 2.1. 

Theorem 2.1. Let p be a prime number with p > 5. Let A be an integer such 
that A 2 — 4 is not a square modulo p. Define E as the elliptic curve y 2 — x 3 + 
Ax 2 + x over the field F p . Define X 0 : E( F p 2 ) —> F p 2 as follows: A 0 (oo) = 0; 
X 0 (x,y ) = x. Let n be an integer. Let q be an element of F p . Then there exists 
a unique s G F p such that X 0 (nQ ) = s for all Q G E(F p z) such that X 0 (Q) — q. 

In particular, define p as the prime 2 255 — 19. Define F p as the prime field 
Z /p — 7i/(2 23 ^ — 19). Note that 2 is not a square in F p ; define F p 2 as the field 
(Z/(2 255 — 19))[v / 2]. Define A = 486662. Note that 486662 2 —4 is not a square in 
F p . Define E as the elliptic curve y 2 — x 3 + Ax 2 + x over F p . Define a function 
Xq : U(F p 2 ) —> F p2 as follows: Ao(oo) = 0; Xq(x, y) = x. Define a function 
X : U(F p 2 ) —> {oo} UF P 2 as follows: X(oo) = oo; X(x,y) = x. 

At this point I could say that, given n € 2 254 + 8{0,1, 2, 3,..., 2 251 — 1} 
and q G F p , the Curve25519 function produces s in Theorem 2.1. However, to 
match cryptographic reality and to catch the types of design error explained by 
Menezes in [45], I will instead define the inputs and outputs of Curve25519 as 
sequences of bytes. 

The set of bytes is, by definition, (0,1,..., 255}. The encoding of a byte as 
a sequence of bits is not relevant to this document. Write s i—> s for the standard 
little-endian bijection from {0,1,..., 2 256 — l} to the set (0, 1,..., 255} 32 of 32- 
byte strings: in other words, for each integer s G {0,1,.. ., 2 256 — l}, define 
s = (s mod 256, |_s/256j mod 256,..., |_s/256 31 J mod 256). 



The set of Curve25519 public keys is, by definition, {0,1,..., 255} 32 ; in 
other words, {2 : q G {0,1,..., 2 256 — l} j. The set of Curve25519 secret keys 
is, by definition, {0, 8,16, 24,..., 248} x {0,1,..., 255} 30 x (64, 65, 66,..., 127}; 
in other words, |n : n G 2 254 + 8{ 0, 1, 2 , 3,..., 2 251 — l}}. 

Now Curve25519 : (Curve25519 secret keys} x (Curve25519 public keys} —> 
{Curve25519 public keys} is defined as follows. Fix q G {0,1,..., 2 256 — 1} and 
n G 2 254 + 8{0,1, 2, 3,..., 2 251 — l}. By Theorem 2.1, there is a unique integer 
s G {0,1,2,..., 2 255 — 20} with the following property: s — X 0 (nQ) for all 
Q G E(F p 2 ) such that A 0 ((3) = g mod 2 255 — 19. Finally, Curve25519(n, q) is 
defined as s. Note that Curve25519 is not surjective: in particular, its final output 
bit is always 0 and need not be transmitted. 

3 Security 

This section discusses attacks on Curve25519. The bottom line is that all known 
attacks are extremely expensive. 

Responsibilities of the user. The legitimate users are assumed to generate 
independent uniform random secret keys. A user can, for example, generate 32 
uniform random bytes, clear bits 0,1,2 of the first byte, clear bit 7 of the last- 
byte, and set bit 6 of the last byte. 

Large deviations from uniformity can eliminate all security. For example, if 
the first 16 bytes of the secret key n were instead chosen as a public constant, 
then a moderately large computation would deduce the remaining bytes of n 
from the public key Curve25519(n, 9). This is not Curve25519’s fault; the user 
is responsible for putting enough randomness into keys. 

Legitimate users are also assumed to keep their secret keys secret. This means 
that a secret key n is not used except to compute the public key Curve25519(n, 9) 
and to compute the shared-secret hash iL(Curve25519(n, q)) given q. 

Users are not assumed to throw n away after a single <j_. Diffie-Hellman secret 
keys can—and, for efficiency, should—be reused with many public keys, as in 
[23, Section 3]. Each user’s secret key n is combined with many other users’ 
public keys £ 1 , 92 , 23 , • • -, producing shared-secret hashes II (C 1 irve25519(n. <]\ ) ), 
iL(Curve25519(n, 22 )), iL(Curve25519(n, 23 )), .... 

Choice of key-derivation function. There are no theorems guaranteeing the 
safety of any particular key-derivation function H with, e.g., 512-bit output. 
Some silly choices of H are breakable. As an extreme example, if H outputs just 
64 bits followed by all zeros, then an attacker can perform a brute-force search 
for those 64 bits. 

O 11 the other hand, from the perspective of a secret-key cryptographer, it 
seems very easy to design a safe function H. A small amount of mixing, far less 
than necessary to make a safe secret-key cipher, stops all known attacks. 

For concreteness I will define H(xq, X\, X 2 , X 3 , X 4 , X 5 , Xq, X 7 ) as the 64-byte 

string Salsa20(co, #o, 0, xi, X 2 , ci, £ 3 , 0, 0, £ 4 , C 2 , £ 5 , a;6, 0, £ 7 , C 3 ). Here Salsa 20 is 



the function defined in [13, Section 8 ]; (co, ci, C 2 , C 3 ) is “Curve25519output” in 
ASCII; and each Xi has 4 bytes. 

If fewer than 64 bytes are needed then the Salsa20 output can simply be 
truncated. If more than 64 bytes are needed then Salsa20 can be invoked again 
with (co, x'o, 1, Xi,...) to produce another 64 bytes. 

Powers of the attacker. A 11 attacker sees public keys jj_ — Curve25519(ni, 9), 
q-> — Curve25519(n2, 9), ... generated from the legitimate users’ independent 
uniform random secret keys n_\_. ri 2 _..... 

The attacker also sees messages protected by a secret-key cryptosystem C 
where the keys for C are the shared-secret hashes II( C 11 rve25519(n,-, (]■,)) = 
If (Curve25519(n 7 -, qi)) for various sets \i. j). The attacker’s goal is to decrypt 
or forge these messages. 

The attacker can also compute a public key q' (j (j 2 _....} and—by using 
q' in the Diffie-Hellman protocol—see messages protected by C where the keys 
for C are if (Curve25519(ni, q')), if (Curve25519(n 2 , q')), .... This would be 
pointless if the attacker generated q' in the normal way, but the attacker is not 
required to generate q' in the normal way; legitimate users are not assumed to 
check that q' was generated from a secret key, let alone a secret key known to 
the attacker. The attacker might take q r = L for example, or q r — q\ 0 1. The 
attacker can adaptively generate many public keys q 1 . 

Of course, security depends on the choice of secret-key cryptosystem C. One 
could make a poor choice of C, allowing messages to be decrypted or forged 
without any weakness in Curve25519. But standard choices of C are conjectured 
to be safe. Further discussion of the choice of C is outside the scope of this 
document. 

Simplified attack notions. There are many papers using simpler models of 
Diffie-Hellman attackers, and proving theorems of the form “a fast attack in 
complicated-security-model implies a fast attack in simplified-security-model.” 
The reader might wonder why I am not using one of these simplified notions. 

Example: Bentahar in [10], improving an algorithm by Muzereau, Smart, 
and Vercauteren in [48] based on an idea by Maurer in [44], showed that one 
can evaluate discrete logarithms on typical elliptic curves using roughly 2 13 calls 
to a reliable oracle for the function (■ mQ,nQ) 1 —> mnQ. Bentahar then repeated 
the standard conjecture that computing discrete logarithms on a typical 256-bit 
elliptic curve costs at least 2 128 (never mind the question of exactly what “cost” 
means), and deduced the conjecture that computing (■ mQ,nQ ) 1 —> mnQ costs at 
least 2 115 . Why, then, should one make a conjecture regarding the difficulty of 
computing (mQ, nQ ) 1 —»• mnQ, rather than a simplified conjecture regarding the 
difficulty of computing discrete logarithms? 

Answer: A standard conjecture says that computing ( mQ,nQ ) 1 —>• mnQ costs 
at least 2 128 . This conjecture is quantitatively stronger than anything that can 
be obtained by applying Bentahar’s theorem to a simplified conjecture. 

Similar comments apply to other theorems of this type; see, e.g., [39, Section 
3.2], Often the theorems are so weak that they say nothing about any real-world 
system. To focus attention on the security properties that applications actually 



need, I have chosen to make a complicated but strong conjecture about security, 
rather than a simplified but weak conjecture. 

Generic discrete logarithms by the rho and kangaroo methods. The 

attacker can expand Curve25519(n, 9) into a point (x. y) on E(F p z), namely the 
nth multiple of the base point (9,...). The attacker can then use Pollard’s rho 
method or Pollard’s kangaroo method to compute the discrete logarithm of this 
point, namely n. The main cost in either method is the cost of performing a huge 
number of additions of elliptic-curve points; both methods are almost perfectly 
parallelizable, with negligible communication costs. See [63], [55], [61], and [60]. 

The number of additions here is about the square root of the length of the 
n interval: in this case, about 2 125 . The computation can finish after far fewer 
additions, but the success chance is at most (and conjecturally at least) about 
a 2 /2 251 after a additions. 

How many elliptic-curve additions can an attacker perform? The traditional 
estimate is roughly 2 70 elliptic-curve additions: a modern CPU costs about 2 6 
dollars; a modern CPU cycle is about 2 -31 seconds; each elliptic-curve addition 
in the rho or kangaroo method costs about 2 10 CPU cycles for roughly 2 2 field 
multiplications that each cost 2 8 cycles; the attacker is willing to spend a year, 
i.e., 2 25 seconds; the attacker can afford to spend 2 30 dollars. 

I don’t agree with the traditional estimate. I agree that modern circuitry 
takes about 2 -21 seconds for a single rho/kangaroo step; but it is a huge error to 
assume that this circuitry costs as much as 2 6 dollars. One can fit many parallel 
rho/kangaroo circuits into the same amount of circuitry as a modern CPU. A 
reasonable estimate for “many” is 2 10 ; see [28] for a fairly detailed chip design, 
and [28, Section 5.2] for the estimate. By switching to this chip, the attacker 
can perform roughly 2 80 elliptic-curve additions. The attacker has an excellent 
chance of computing a 160-bit discrete logarithm, but only about a 2“ 90 chance 
of computing a 251-bit discrete logarithm. 

Of course, one must adjust these estimates as chip technology improves. It 
is not enough to account for increases in cycle speed and for decreases in chip 
cost; one must also account for increases in chip size. However, the Curve25519 
security level will remain comfortable for the foreseeable future. 

Batch discrete logarithms. Silverman and Stapleton observed, and Kuhn 
and Struik proved in [41, Section 4] assuming standard conjectures, that the rho 
method can compute u discrete logarithms using about yju times as much effort 
as computing a single discrete logarithm. 

For example, given public keys Curve25519(ni, 9),..., Curve25519(n tt , 9), the 
attacker can discover most of the secret keys ..., riu using only about 2 125 yju 
additions, i.e., about 2 125 /a Ju additions per key. 

This does not mean, however, that one of the keys will be found within 
the first 2 125 /y / n additions. On the contrary: the attacker is likely to wait for 
2 125 additions before finding the first key, then another 2 125 (y / 2 — 1) additions 
before finding the second key, etc. Curve25519 is at a comfortable security level 
where finding the first key is, conjecturally, far out of reach, so the reduced cost 
of finding subsequent keys is not a threat. The attacker can perform only 2 125 e 



additions for small e, so the attacker’s chance of success—of finding any keys—is 
only about e 2 . 

Generic discrete logarithms are often claimed to be about as difficult as 
brute-force search for a half-size key. But brute-force search computes a batch of 
u keys with about the same effort as computing a single key. Furthermore, brute- 
force search has probability roughly ue of finding some key after the first e of 
the computation, whereas discrete logarithms have only an e 2 chance. Evidently 
generic discrete logarithms are more difficult than brute-force search for a half¬ 
size key: ue is much larger than e 2 , except in the extreme case where u and e are 
both close to 1. 

Small-subgroup attacks. If the subgroup of E(F p 2 ) generated by the base 
point (9,...) has non-prime order then the attacker can use the Pohlig-Hellman 
method to save time in computing discrete logarithms. See, e.g., [5, Section 19.3]. 

This attack fails against Curve25519. The order of the base point is a prime, 
namely 2 252 + 2 77423 1 777737235353585193 7790883648493. 

An active attacker has more options. Say there is a point (x,y) G E(F p 2 ) 
of order 6, with x G F p and with 6 not very large. The attacker can issue a 
public key x_. The legitimate user will then authenticate and encrypt data under 
if(Curve25519(n, x)) = H(X 0 (n(x,y))) — if(X 0 ((n mod b)(x,y))); the attacker 
can compare the results to all possibilities for n mod 6, presumably determining 
n mod b. 

The active attack also fails against Curve25519. The group {oo} U (E(F p 2 ) n 
(F p x Fp)) has size 8 pi, where pi = 2 252 + • • • is the prime number displayed 
above. The “twist” group {oo}U(F(F p 2 )n(Fp x v^Fp)) has size 2(p + l) —8pi = 
4 p 2 , where p 2 is the prime 2 253 - 5548463555474470707170387558 1 767296995. 
Consequently, the only possibilities for b below 2 252 are 1,2, 4, 8. Secret keys n 
by definition have n mod 8 = 0 and thus n mod 6 = 0. 

History: Lim and Lee in [42] pointed out active attacks on Diffie-Hellman 
in the group F*. They recommended in [42, Section 4] that, rather than taking 
the time to test that public keys are in a particular subgroup of prime order q. 
one choose a prime p such that “each prime factor of (p — l)/2 q is larger than 
g.” Biehl, Meyer, and Muller in [14, Section 4.1] pointed out analogous attacks 
on elliptic curves when public keys are represented as pairs (x,y); they did not 
propose any workaround other than testing keys. In a November 2001 sci . crypt 
posting I wrote “You can happily skip both the y transmission and the square 
root. In fact, if both the curve and its twist have nearly prime order, then you 
can even skip square testing.” 

Other attacks. The kangaroo method actually searches simultaneously for nj8 
and pi — n/8 in an interval. The range of n /8 is {2 251 ,..., 2 252 — 1 j, so either 
n/8 or pi — n/8 is in the range { (pi + l)/2,..., 2 252 — l}. However, pi is only 
marginally above 2 252 , so this range has length only marginally below 2 251 . 

More generally, when a group G has an easily computed automorphism p of 
small order 6, one can apply the kangaroo method to the orbits of p, using only 
about \J//G/b steps rather than \f/fG steps. See, e.g., [5, Section 19.5.5]. But 
my elliptic curve has no structure of this type other than negation. In fact, it 



lias no complex endomorphisms of small norm. To prove this, compute the trace 
t — p + 1 — 8 pi, and observe that t 2 — 4 p is not a small multiple of a square: it 
is divisible once by the prime 8312956054562778877481, for example. 

My elliptic curve also resists the transfer attacks surveyed in [30, Chapter 
22], The primes pi and p-j do not equal the field characteristic p. The order of p 
modulo pi is not small: in fact, it is (pi — l)/ 6 . The order of p modulo p 2 is not 
small: in fact, it is P 2 — 1. Weil descent simply splits E(F p 2 ) into the subgroup 
E(F p ), of order 8 pi, and the twist, of order 4 p 2 ; there are no proper subfields of 
F p to exploit. 

4 Fast arithmetic modulo 2 255 — 19 

This section explains one way to use common CPU instructions, specifically 
floating-point instructions, to quickly multiply and add in the field F p where 
p — 2 255 — 19. I will focus on the Pentium M for concreteness, but the same 
techniques work well for a wide variety of CPUs. This section also discusses the 
choice of field structure and the choice of prime. 

In this section, “floating-point” is abbreviated “fp.” 

Representing integers modulo 2 255 — 19. Define R as the ring of polynomials 
'Yh i Ui,x l where tq is an integer multiple of 2 ^ 25 ’ 5 ^. One way to see that R is a 
ring is to observe that it is the intersection of the subrings Z[x\ and Z[2 25 ’ 5 x\ of 
Z[x\, where Z is the ring of algebraic integers in C. 

Elements of R represent elements of Z/(2 255 —19): each polynomial represents 
its value at 1. Often a polynomial is chosen to meet two restrictions: 

• The polynomial degree is small, to limit the number of coefficients that need 
to be multiplied as part of polynomial multiplication. Specifically, reduced- 
degree polynomials have degree at most 9. 

• Each coefficient Ui is a small multiple of 2 ^ 25 - 5 l f, to limit the effort of 
multiplying coefficients. Specifically, reduced-coefficient polynomials have 
Ui/ 2r 25 -5d e {—2 25 , —2 25 + 1,..., —1, 0,1,..., 2 25 — 1, 2 25 }. 

To summarize: A reduced-degree reduced-coefficient polynomial is a polynomial 

'Uo+'UiaH- \-uqx 9 with uo/ 2 °, wi/ 2 26 , tx 2 / 2 51 , u. 3 / 2 77 , U 4 / 2 102 , U 5 / 2 128 , uq/ 2 153 , 

u 7 / 2 179 , u 8 / 2 204 , W 9 / 2 230 all in {— 2 25 , - 2 25 + 1 ,..., - 1 , 0 , 1 ,..., 2 25 - 1 , 2 25 }. 
This polynomial represents the integer uq + U\ + • • • + Ug. 

Note that integers are not converted to a unique “smallest” representation 
until the end of the Curve25519 computation. Producing reduced representations 
is generally much faster than producing “smallest” representations. 

Representing coefficients inside CPUs. The Pentium M has eight “fp 
registers,” each of which holds a real number 2 e f for integers e and / with 
/ G {— 2 64 ,..., 2 64 } and with e in an adequate range for all the computations 
discussed here. My computations hold polynomial coefficients in fp registers to 
the extent possible, as in [11, Section 4], 



The Pentium M has many more “Ll-cache doublewords” that can hold 2 e f 
with / limited to the range {—2 53 ,..., 2 53 }; e.g., reduced coefficients. To perform 
arithmetic on numbers in Ll-cache doublewords, the Pentium M must take time 
to copy (“load”) the numbers into registers; but this is not a big problem, because 
these loads can be overlapped with arithmetic if they are not too frequent. 

Why split 255-bit integers into ten 26-bit pieces, rather than nine 29-bit 
pieces or eight 32-bit- pieces? Answer: The coefficients of a polynomial product 
do not fit into the Pentium M’s fp registers if pieces are too large. The cost of 
handling larger coefficients outweighs the savings of handling fewer coefficients. 
The overall time for 29-bit pieces is sufficiently competitive to warrant further 
investigation, but so far I haven’t been able to save time this way. I’m sure that 
32-bit pieces, the most common choice in the literature, are a bad idea. 

Of course, the same question must be revisited for each CPU. The Pentium 1, 
Pentium MMX, Pentium Pro, Pentium II, Pentium III, Pentium 4, Athlon, and 
Athlon XP work well with 26-bit pieces; on the Athlon 64 and Opteron, 32-bit 
pieces might be slightly better. On the UltraSPARC and PowerPC, fp registers 
use {— 2 53 , ..., 2 53 j rather than {— 2 64 , ..., 2 64 }, and I recommend twelve 22- 
bit pieces. The UltraSPARC and PowerPC can overlap fp additions with fp 
multiplications, so I expect them to end up with comparable cycle counts to the 
Pentium M despite the larger number of pieces. 

Given that there are 10 pieces, why use radix 2 25 ' 5 rather than, e.g., radix 
2 25 or radix 2 26 ? Answer: My ring R contains 2 255 x 10 — 19, which represents 
0 in Z/(2 255 — 19). I will reduce polynomial products modulo 2 255 x 10 — 19 to 
eliminate the coefficients of x 10 , x 11 , etc. With radix 2 25 , the coefficient of x 10 
could not be eliminated. With radix 2 26 , coefficients would have to be multiplied 
by 2 5 ■ 19 rather than just 19, and the results would not fit into an fp register. 

Using floating-point operations. The Pentium M has circuits for three fast- 
operations on numbers stored in fp registers: sum, difference, and product. These 
are exact operations if the results fit into the 64-bit- fp precision; otherwise the 
results are rounded to the nearest fp numbers. 

The Pentium M can perform, at best, one fp operation per cycle. About 92% 
of the cycles in my Curve25519 computation (589825 out of 640838) are occupied 
by fp operations. One can understand the cycle counts fairly well by simply 
counting the fp operations. Similar comments apply to other CPUs, although 
the details depend on the CPU. 

Warning: Writing an fp program in the C programming language, and feeding 
the result- to a C compiler, often produces machine language that takes 3 or more 
Pentium M cycles for each fp operation. Further discussion of this phenomenon 
is outside the scope of this paper. My Curve25519 software is actually written 
in qhasm, a new programming language designed for high-speed computations. 

Beware that a few CPUs have input-dependent fp timings. An old example 
is the Sun microSPARC-IIep. A newer example is the IBM PowerPC RS64 IV, 
which takes an extra cycle to multiply by 0. Fast constant-time computations 
on these CPUs need extra effort. 



Adding integers modulo 2 255 — 19. If two integers are represented by two 
polynomials u and v then the sum of the two integers is represented by u + v. 
Similarly, the difference of the two integers is represented by u — v. 

If u and v are reduced-degree reduced-coefficient polynomials then computing 
u+v (or u—v) involves 10 additions (or subtractions) of fp numbers. Note that the 
sum is reduced-degree but usually not reduced-coefficient. In a long chain of sums 
one would occasionally have to take extra time to reduce the coefficients. This 
is never necessary in the Curve25519 computation: every sum (and difference) 
is used solely as input to products, as Appendix B illustrates. 

Statistics: Each addition or subtraction takes 10 fp operations. There are 
8 additions and subtractions, totalling 80 fp operations, in each iteration of 
the Curve25519 main loop. There are 2040 additions and subtractions, totalling 
20400 fp operations, in the entire Curve25519 computation. 

Multiplying integers modulo 2 255 — 19. If two integers are represented by 
polynomials u and v then their product is represented by the polynomial product 
uv. If a and v are reduced-degree reduced-coefficient polynomials, or sums of 
two such polynomials, then computing uv in the simplest way involves 100 fp 
multiplications and 81 fp additions; I am experimenting with other polynomial- 
multiplication algorithms and expect to end up with slightly better results. The 
product uv is then replaced by a reduced-degree reduced-coefficient polynomial: 

• The coefficients of a; 10 , x ll : ..., a; 18 in uv are eliminated by reduction modulo 
2 255 x 10 —19. p or example, the coefficient of a; 18 is multiplied by 19-2 -255 and 
added to the coefficient of a; 8 . Each reduction involves 1 fp multiplication 
and 1 fp addition. 

• The “high” part of each coefficient is subtracted from that coefficient and 
added (“carried”) to the next coefficient. The high part is, by definition, the 
nearest multiple of the power of 2 for the next coefficient. One carry involves 
4 fp additions: 2 to identify the high part (by a rounded addition and then 
subtraction of a large constant), 1 to subtract, and 1 to add. 

Starting from uv, I carry from x 8 to x 9 , then from .x 9 to x 10 ; then I eliminate 
coefficients of x l9 ,x 11 , ... , x 18 ; then I carry from x° to x 1 , from x 1 to x 2 , ..., 
from x 1 to x 8 , and once more from x 8 to x 9 . Note that the coefficient of x 9 is a 
multiple of 2 230 , and is between —2 254 and 2 254 after subtraction of its original 
high part, so the final carry from .x s to x 9 produces reduced coefficients. Overall 
there are 18 fp operations to eliminate 9 coefficients, and 44 fp operations for 
11 carries. There are many other reasonable carry sequences; on some CPUs it 
might be a good idea to have two parallel carry chains, decreasing latency at the 
expense of an extra carry. 

Squaring is easier than general multiplication, because polynomial squaring 
is easier than general polynomial multiplication. Overall a squaring eliminates 
9 2 + 9 coefficient multiplications at the expense of 9 initial coefficient doublings; 
note that doubling coefficients at the beginning is slightly better than doubling 
products later. Multiplication by a small constant is also easier than general 
multiplication, because the constant is represented by a polynomial of degree 0. 



Statistics: Each multiplication by a small constant takes 55 fp operations. 
Each squaring takes 162 fp operations. Each general multiplication takes 243 fp 
operations. Each iteration of the Curve25519 main loop has 1 multiplication by a 
small constant, using 55 fp operations; 4 squarings, using 648 fp operations; and 
5 general multiplications, using 1215 fp operations; in total 10 multiplications, 
using 1918 fp operations. The Curve25519 computation has 255 multiplications 
by small constants, using 14025 fp operations; 1274 squarings, using 206388 fp 
operations; and 1286 general multiplications, using 312498 fp operations; in total 
2815 multiplications, using 532911 fp operations. 

Note that the squaring-to-multiplication floating-point-operation ratio is only 
162/243 = 2/3, far below the 0.8 ratio often used in the literature for estimating 
the costs of elliptic-curve operations. 

Selecting integers. Consider the problem of computing x[b\, where cr[0],a;[l] 
are integers modulo 2 255 —19 and b is an input-dependent bit. Using b as an array 
index—without taking extra time for preloads, interrupt elimination, etc.—could 
allow hyperthreading attacks and other cache-timing attacks; see [12, Sections 
8-15]. I instead compute x[b] as (1 — 6):r[0]+6a:[l]. Similarly, if I need to compute 
the pair (x[6], x[\ — 6]), I compute (x[0] — 6(x[0] — x[l]), x[l] + 6(x[0] — x[l])). 

Statistics: Each iteration of the Curve25519 main loop has 2 fp operations 
inside computing b and 1 — 6; 2 paired selections, taking 80 fp operations; and 
2 more selections, taking 60 more fp operations. The total is 142 fp operations. 
The entire Curve25519 computation spends 36210 fp operations, about 6% of 
the total, on selection. Of course, these operations could be eliminated if timing 
attacks were not a concern. 

Why this field? CPUs include fast integer-multiplication circuits (usually 
buried inside fp-multiplication circuits aimed at the large fp market) but not 
circuits for fast multiplication of polynomials modulo 2. Characteristic-2 fields 
allow several other speedups—see, e.g., [35, Section 3.4] and [25, Section 15.1]- 
but I can’t see any way for them to set speed records on existing CPUs. 

“Optimal extension fields,” such as degree-10 extensions of prime fields of 
size around 2 26 , are advertised in [7] and [6] as allowing faster multiplication and 
much faster inversion, perhaps so fast as to make affine-coordinate elliptic-curve 
computations faster than projective-coordinate elliptic-curve computations. My 
current assessment is that these fields have some slight advantages: there are no 
carry chains, so operations are easier to reorder; there are 10 reductions modulo 
a prime, rather than 11 carries, although one reduction is usually slightly more 
expensive than one carry; inversion is faster, although not fast enough to make 
affine coordinates worthwhile; and, most importantly, degree 9 might fit into 
64-bit fp. Unfortunately, these fields have a huge disadvantage: even if they are 
slightly faster on some CPUs, they are much slower on other CPUs. A 255-bit 
integer can be split into 4 or 8 or 10 or 12 pieces to accommodate the capabilities 
of different processors; an “optimal extension field” is tied to a particular number 
of pieces. 



So I selected a prime field. Prime fields also have the virtue of minimizing 
the number of security concerns for elliptic-curve cryptography; see, e.g., [29] 
and [22], 

I chose my prime 2 255 —19 according to the following criteria: primes as close 
as possible to a power of 2 save time in field operations (as in, e.g, [9]), with no 
effect on (conjectured) security level; primes slightly below 32 k bits, for some 
k , allow public keys to be easily transmitted in 32-bit words, with no serious 
concerns regarding wasted space; k = 8 provides a comfortable security level. I 
considered the primes 2 255 + 95, 2 255 — 19, 2 255 — 31, 2 254 + 79, 2 253 + 51, and 
2 253 + 39, and selected 2 255 — 19 because 19 is smaller than 31, 39, 51, 79, 95. 

5 Fast Curve25519 computation 

This section explains fast ^-coordinate point addition on my elliptic curve y 2 = 
x 3 + 486662x 2 + X] explains fast ^-coordinate scalar multiplication, i.e., fast 
computation of Curve25519; and compares this curve to other elliptic curves. 

Recall that Section 2 defines two ^-coordinate functions. One function X {) 
maps oo to 0; the other function X maps oo to oo. Curve25519 is defined using 
Xq, but inside the computation it is convenient to use X until the last moment. 

Addition. Montgomery in [47, Section 10.3.1] published formulas to compute 
X (2 Q) given X (■ Q ), and to compute X(Q + Q') given X(Q),X{Q'), X(Q — Q'), 
assuming that Q ^ oo, Q' ^ oo, Q — Q' ^ oo, Q + Q' ^ oo. It turns out that 
Montgomery’s formulas also work for oo, provided that Q — Q' ^ {oo, (0, 0)}, so 
the Curve25519 computation can avoid checking for oo. See Appendix B of this 
paper. 

Montgomery’s formulas represent each X value as a fraction x/z, replacing 
divisions with multiplications. Montgomery commented that, when d is large, 
one can perform d divisions in F p at about the same cost as 4 d multiplications 
in F p , so dividing x by 2 may be a good idea when there are many separate 
elliptic-curve computations to perform at once; I have not implemented this 
option yet. 

The formula for X(2Q) involves 2 squarings, 1 multiplication by 121665 = 
(486662 — 2)/4, and 2 more multiplications. The formula for X(Q + Q') involves 2 
squarings and 3 more multiplications when z\ in Theorem B.2, the denominator 
of X(Q — Q'), is known to be 1; otherwise it involves 2 squarings and 4 more 
multiplications. The Curve25519 computation always has z\ = 1. 

Scalar multiplication. Montgomery suggested using his formulas to obtain 
X(nQ + Q),X(nQ),X(Q) given X{\n/2\Q + Q), X( [n/2\Q), X(Q): if n is even 
then nQ = 2 [n/ 2 \Q and nQ + Q = {\n/ 2 \Q + Q) + ({n/2jQ); if n is odd then 
nQ + Q — 2{\n/2\Q + Q ) and nQ — (\n/2\Q + Q) + (|_n/2jQ). Either case 
involves one doubling and one addition. 

The formulas, repeated k times, produce X(nQ + Q), X(nQ), X(Q) with k 
doublings and k additions starting from X{ |_n/2 fc J Q + Q), X( |_n/2 fc J Q), X(Q). I 
compute X(nQ) for any n G 2 254 + 8{0,1,..., 2 251 — 1 j with 255 doublings and 



255 additions starting from X(Q),X(0),X(Q). The first and last few iterations 
could be simplified. 

The final X(nQ), like other X values, is represented as a fraction x/z. I 
compute X 0 (nQ) = xz p ~ 2 using a straightforward sequence of 254 squarings 
and 11 multiplications. This is about 7% of the Curve25519 computation. An 
extended-Euclid inversion of z 1 randomized to protect against timing attacks, 
might be faster, but the maximum potential speedup is very small, while the 
cost in code complexity is large. 

Theorems B.l and B.2 justify the above procedure if Xq(Q) ^ 0. The same 
formulas also work for Xq(Q) — 0: every computed fraction has denominator 0, 
so the final output is 0 as desired. 

Other addition chains. Montgomery pointed out that one can replace the 
addition chain { [n/2 fe J }u{ |_n/2 fc J +1} with any differential addition chain (any 
“Lucas chain”), i.e., any addition chain where each sum is already accompanied 
by a difference. One can find such a chain with only about 384 elements, as 
discussed in [59, Section 5]. On the other hand, most of the additions then 
require z± ^ 1 in Theorem B.2, costing extra multiplications in F p . It is also 
not clear how easily these addition chains can be protected against cache-timing 
attacks. Further investigation is required. 

A more common strategy is to drop the difference requirement, compensate 
by computing more coordinates of each multiple of Q (Jacobian coordinates, 
for example, or Chudnovsky coordinates), and use an addition chain with only 
about 320 elements. See, e.g., [17] or [4], Unfortunately, even if A is selected 
so that y 2 — x 3 + Ax 2 + x is isomorphic to a curve y 2 = x 3 — 3x — a e, each 
doubling in known coordinate systems takes at least 8 field multiplications, and 
each general addition takes even more. All of my experiments with this strategy 
have ended up using more field operations, more floating-point operations, and 
more cycles than the x-coordinate strategy. 

One can save a large fraction of the time for computing Curve25519(n, q) 
when q is fixed—in particular, for computing public keys Curve25519(n, 9)— 
by precomputing various multiples of ( q ,...). An essentially optimal algorithm, 
published by Pippenger in [52] in 1976, computes u public keys with only about 
256/lg 8 u additions per key. This speedup is negligible in the Difhe-Heilman 
context (and is not provided by my current software), since each key is used 
many times; but the speedup is useful for other applications of elliptic curves. 

Why this curve? I chose the curve shape y 2 = x 3 + Ax 2 + x, as suggested 
by Montgomery, to allow extremely fast x-coordinate point operations. Curves 
of this shape have order divisible by 4, requiring a marginally larger prime for 
the same conjectured security level, but this is outweighed by the extra speed 
of curve operations. I selected (A — 2)/4 as a small integer, as suggested by 
Montgomery, to speed up the multiplication by (A — 2)/4; this has no effect on 
the conjectured security level. 

To protect against various attacks discussed in Section 3, I rejected choices 
of A whose curve and twist orders were not {4 • prime, 8 ■ prime}; here 4, 8 are 
minimal since p G 1+4Z. The smallest positive choices for A are 358990, 464586, 



and 486662. I rejected A — 358990 because one of its primes is slightly smaller 
than 2 252 , raising the question of how standards and implementations should 
handle the theoretical possibility of a user’s secret key matching the prime; 
discussing this question is more difficult than switching to another A. I rejected 
464586 for the same reason. So I ended up with A = 486662. 

Special curves with small complex automorphisms have potential benefits, 
as discussed in [31], and are worth further investigation, but so far I have not 
succeeded in saving time using them. 
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A Appendix: rings, fields, and curves 

This appendix reviews elliptic curves at the level of generality of Theorem 2.1. 
See [24, Chapter 13] for much more information about elliptic curves. 

The base field. Let p be a prime number with p > 5. Define F p as the set 
{0,1,... ,p — 1}. Define a binary operation + on F p as addition mod p. Define 
a binary operation • on F p as multiplication mod p. Define a unary operation — 
on F p as negation mod p. 

F p is a commutative ring under 0,1, —, +, •. This means that it satisfies every 
0,1, —, +, • identity satisfied by Z; e.g., the identity a{b + c + 1) — ab + ac + a. 
Furthermore, because p is prime, F p is a field: every nonzero element of F p has 
a reciprocal in F p . 

Squares in the base field. Squaring is a 2-to-l map on the nonzero elements 
of Fp, so there are exactly (p — l)/2 non-squares in F p . Find the smallest 5 E 
{1, 2,..., p — 1} such that 5 is not a square in F p . 

Fermat’s little theorem implies that a (p - 1 V 2 = 1 if a is a nonzero square 
in F p ; a ( p_1 )/ 2 = —1 if a is a 11011 -square in F p ; and cib 5-1 )/ 2 = 0 if a = 0. 
Consequently, if a is a non-square in F p , then a/5 is a nonzero square in F p . 

The extension field. Define F p 2 as the set F p x F p . Define a unary operation 
— on Fp 2 by —(c, d ) = (—c, — d ). Define a binary operation + 011 F p 2 by (a, b) + 
(c, d) = (a + c,b + d). Define a binary operation ■ on F p 2 by (a, b) ■ ( c,d ) = 
(ac + 5bd, ad + be). 

F p 2 is a commutative ring under 0,1,—,+,-. Furthermore, each nonzero 
(a, b) E F p 2 has a reciprocal (a/(a 2 — 5b 2 ), —6/(a 2 — 5b 2 )) E F p 2 . 

The injection a 1 —> (a, 0) from F p to F p 2 is a ring morphism: it preserves 
0,1, —, +, •. Thus (a, 0) is abbreviated a without risk of confusion. The element 
(0,1) of Fp 2 is abbreviated v^; it satisfies V5 2 — ( 5, 0) = 5. 



The elliptic curve. Let A be an integer such that A 2 — 4 mod p is not a square 
in F p . Define E(F p z) as {00} U {(cc, y) G F p 2 : y 2 = x 3 + Ax 2 + x}. 

Define a unary operation — on E( F p 2) as follows: — oo = oo; —(x,y) — 
(x, —y)- Define a binary operation + on E( F p 2) as follows: 

• oo + oo — oo. 

• OO + (x,y) = (x,y). 

• (x,y) + oo = (x,y). 

• (x,y) + (x, -y ) = oo. 

• If y ^ 0 then (x,y) + (x,y) = (x",y") where A = (3a: 2 + 2 Ax + l)/2y, 
x" — X 2 — A — 2x — ( x 2 — l) 2 /Ay 2 , and y" — X(x — x") — y. Here / refers to 
division in F p 2. 

• If x' ^ x then (x,y) + (x',y') = (x",y") where A = (y f — y)/(x' — x), 
x" — A 2 — A — x — x', and y" = X(x — x") — y. 

Standard (although lengthy) calculations show that E(F p 2) is a commutative 
group under oo, —,This means that every 0 ,—, + identity satisfied by Z is 
also satisfied by E(F p 2) when 0 is replaced by oo. 

Note that the following three sets are subgroups of E{F p 2): 

• {oo, ( 0 , 0 )}. Indeed, 00 + 00 = 00; ( 0 , 0 ) + ( 0 , 0 ) = 00; and ( 0 , 0 ) + 00 = ( 0 , 0 ). 

• {00} U (E(F p 2 ) fl (Fp x Fp)). Indeed, if x,y,x',y' G F p then the quantities 
A, x",y" defined above are in F p . 

• {00} U (E(F p 2) fl (Fp x \/ 5 Fp)). This time A is a ratio of an element of F p 
and an element of v^F p , and is therefore an element of v^Fp, producing 
x" G Fp and y" G VS.F p . 

Note also that if x 3 + Ax 2 + x — 0 in F p then x = 0 . (Otherwise A 2 — 4 — 
(x — 1 /x ) 2 in Fp, so A 2 — 4 mod p is a square in F p , contradiction.) In other 
words, (x, 0 ) ^ E(F p 2 ) if x ^ 0 . 

Proof of Theorem 2 . 1 . Let n be an integer. Let q be an element of F p . Define 
a — q 3 +Aq 2 +q. Define Xq : E(F p 2) —> F p 2 as follows: Xo(oo) = 0; Xq(x, y) — x. 

I will show that there are exactly two Q G E(F p 2) such that X 0 (Q) — q. that 
both of them have the same value of Xq( nQ), and that the value is in F p . Here 
nQ means the nth multiple of Q under the above group operations on E(F p 2). 

Case 1 : a = 0 . Then q — 0 . The only square root of 0 in F p 2 is 0 , so 
{Q G E( Fp 2 ) : Xq(Q) = q} is exactly the group {00, ( 0 , 0 )}. Thus each Q G 
E(F p 2) with Xq(Q) = q has nQ G {00, ( 0 , 0 )}; i.e., X 0 (nQ) — 0 . 

Case 2: a is a nonzero square in F p . Select a square root r. Now q ^ 0, and the 
only square roots of q 3 + Aq 2 +q in F p 2 are ±r, so { Q G E(F p 2 ) : Xq(Q) = q] — 
{(g,r), (g, -r)}. Define s = X 0 (n(q,r)). The group { 00 } U (-E(Fp 2 ) fl (F p x F p )) 
contains ( q,r ), so it contains n(q,r), so s G {0, 1,2,31}. Furthermore 
n (?5 —r ) — n ( — (? 5 r )) = ~ n i<L r ), so X 0 (n(q,—r)) — X 0 (n(q,r )) = s. Thus 
Xq {nQ) = s for all Q G E(F p 2 ) such that X 0 (Q) — q. 

Case 3: ct is a 11011-square in F p . Then a /5 is a nonzero square in F p . Select 
a square root r. Now 5 / 0 , and the only square roots of q 3 + Aq 2 + q in 
Fp 2 are ±rVd, so {Q G E(F p 2) : X 0 (Q) = g} = {(g, r\/S), (q, —rVX)}. Define 



s = Xo (n(q, rVS)). The group { 00 } U (E(F p z) D (F p x y/ 8 F p )) contains (q, rVS), 
so it contains n(q, ry/S), so s e {0,1, 2, 3,... ,p — 1}. Furthermore n(q. —r\fd) = 
n(—(q,rVS)) = —n(q,r\/S), so Xo(n(q, — rVS)) = X 0 (n(q,rVS)) — s. Thus 
X 0 (nQ) = s for all Q G E(F p 2 ) such that Xq(Q) — q. □ 

B Appendix: Montgomery’s double-and-add formulas 

This appendix states Montgomery’s .x-coordinate double-and-add formulas, and 
proves that the formulas work whenever Q — Q' ^ {oo, (0,0)}. 

The following diagram summarizes Montgomery’s formulas in the case z i = 1. 
As in Theorems B.l and B.2, x/z and x'/z' are the x-coordinates of points Q, Q'\ 
X2IZ2 is the x-coordinate of 2 Q; x\ is the x-coordinate of Q — Q'; and X3/Z3 is 
the x-coordinate of Q + Q'. 

x ^ x' , z’ 



One can see at a glance that there are 4 squarings, 1 multiplication by (A — 2)/4, 
and 5 other multiplications; and that there are 8 additions/subtractions, none 
of which produce input to another addition/subtraction. 

Theorem B.l. Let p be a prime number with p > 5. Let A be an integer such 
that A 2 — 4 is not a square modulo p. Define E as the elliptic curve y 2 — x 3 + 
Ax 2 T x over the field F p . Define X : E{F p i) —> { 00 } U F p 2 as follows: X(oo) = 
00 ; X(x, y) = x. Fix x,z E F p with (x, z) (0, 0). Define 

X 2 = (x 2 — z 2 ) 2 = (x — z) 2 (x + z) 2 , 

Z 2 — 4 xz(x 2 + Axz + z 2 ) 

= ((x + z ) 2 - (x - z) 2 ) ^(x + z ) 2 + — ((x + z ) 2 - (x - z) 2 )^j . 



Then X(2Q) = X 2 /Z 2 for all Q G E(F p i) such that X(Q) = x/z. 

Here x/z means the quotient of x and z in F p if z 7 ^ 0; it means 00 if x 7 ^ 0 
and 2 = 0 ; it is undefined if x = z = 0 . 

Proof. Case 1: z = 0. Then X 2 = x 4 ^ 0 and Z 2 — 0. Also X(Q) = x/0 = 00 so 
Q — 00 so 2 Q = 00 so X(2Q) — 00 — X 2 /O — X2/Z2. 

Case 2 : z 7 ^ 0 and x = 0. Then £2 = z 4 7 ^ 0 and Z 2 = 0. Also X(Q) — 0/z — 0 
so Q — (0, 0) so 2 Q = 00 so X (2 Q) — 00 = X 2 /O — X 2 /'z 2 - 

Case 3: z ^ 0 and x ^ 0. Then Q — (x/z, y) for some y G F p 2 satisfying 
y 2 = (x/z ) 3 + A(x/z ) 2 + (x/z) and thus 4y 2 z 4 = 4(x 3 z + Ax 2 z 2 + xz 3 ) = z 2 . 
The non-squareness of A 2 — 4 implies that j/ / 0; hence Z 2 7 ^ 0. Also X(2Q) = 
((x/z) 2 — l) 2 /4y 2 by definition of doubling; thus Z 2 X( 2 Q) = z 4 ((x/z) 2 — l) 2 = 
(x 2 — Z 2 ) 2 = X 2 - □ 

Theorem B.2. In the context of Theorem B.l, fix x, z, x', z', xi, z\ G F p with 
(x, z) 7 ^ (0, 0), (x', z') 7 ^ (0, 0), xi 7 ^ 0, and z\ 7 ^ 0. Define 

X 3 = 4(xx' — zz') 2 zi = ((x — z)(x' + z') + (x + z)(x' — z')) 2 zi, 

Z 3 = 4(xz' — zx') 2 xi = ((x — z)(x' + z') — (x + z)(x' — z / )) 2 xi. 

XTten X(Q+Q r ) — X3/Z3 /or all Q , Q' G -E(F p 2 ) snch that X(Q) — x/z, X(Q') = 
x'/z' ; and X(Q — Q') = x\/z\. 

Proof. Case 1: Q — Q'. Then X(Q-Q') = X(oo) = 00 , so zi = 0, contradiction. 

Case 2: Q = 00 . Then z = 0 and x 7 ^ 0; also X(Q — Q') = X(-Q') = X(Q'), 
so X\/Z\ = x'/z', so x' 7 ^ 0 and z' 7 ^ 0. Finally X 3 = 4(xx') 2 Zi and Z 3 = 4(xz') 2 Xi 
so x 3 /z 3 = (x'/z') 2 zi/xi = x’/z’ = X(Q') = X(Q + Q'). 

Case 3 : Q' = 00. Then z' = 0 and x' 7^ 0 ; also X(Q — Q ') — X(Q ), so 
Xi/zi = x/z, so x / 0 and z / 0 . Finally X3 = 4 (xx') 2 Zi and Z3 = 4 (zx') 2 xi so 
£3/^3 = (x/z) 2 zi/xi = x/z = X(< 2 ) = X(Q + Q')- 

Case 4: Q = — Q'. Then X(Q / ) = X(Q) so x/z = x'/z' so xz' = zx' so 
z 3 = 0 . 

Suppose that X 3 = 0. Then (x — z)(x' + z') + (x + z)(x' — z') =0 and 
(x — z)(x'Tz') — (xTz)(x' — z') = 0, so (x —z)(x'+z') = 0 and (x+z)(x' — z') = 0. 
If x + z 0 then x'-z' = 0sox' + z' = 2x' 0 so x — z = 0; i.e., X(Q) — 1 and 

X(Q') — 1. Otherwise x = —z so x - z = 2x / 0 so x' = — z!\ i.e., X(Q) — — 1 
and X(Q') — —1. Either way X(Q — Q ') = X(2Q) = (X(Q ) 2 — l) 2 /--- = 
(1 — l) 2 / • • • = 0 by definition of doubling, so X\ — 0 , contradiction. 

Thus X 3 7 ^ 0, and X 3 /Z 3 = 00 = X(oo) = X(Q + Q'). 

Case 5 : Q 7^ 00; Q' 7^ 00; Q 7^ Q'; and Q 7^ — Q'. Then z / 0 , z' / 0 , 
and x/z x'/z', so Z3 7^ 0 . Find y,y' G F p 2 such that Q = (x/z, y) and 
Q' = (x'/z', y'). Write ct = x'/z'—x/z and j 3 = A+x/z+x'/z'. Then X(Q+Q') = 
((y' — y)/ct) 2 — /3 and X(Q — Q') = ((—y' — y)/ct) 2 — f 3 by definition of Q ± Q', 
so X(Q + Q')X(Q — Q') = f3 2 — 2 / 3 ((y') 2 +y 2 )/a 2 + ((y') 2 —y 2 ) 2 /ct 4 . Substitute 
y 2 = (x/z) 3 + A(x/z ) 2 + (x/z) and (y') 2 = (x'/z') 3 + A(x'/z') 2 + (x'/z') and 
simplify to see that -X/Q + QQ-X/Q —Q') = (xx' —zz') 2 /(xz' —x'z) 2 ; this is what 
Montgomery did. Finally -X/Q+Q') = (xx' —zz') 2 zi/(xz'—x'z) 2 xi = X3/Z3. □ 



